Interactive Robot Education

نویسندگان

  • Riad Akrour
  • Marc Schoenauer
  • Michèle Sebag
چکیده

Aimed at on-board robot training, an approach hybridizing active preference learning and reinforcement learning is presented: Interactive Bayesian Policy Search (IBPS) builds a robotic controller through direct and frugal interaction with the human expert, iteratively emitting preferences among a few behaviors demonstrated by the robot. These preferences allow the robot to gradually refine its policy utility estimate, and select a new policy to be demonstrated, after an Expected Utility of Selection criterion. The paper contribution is on handling the preference noise, due to expert’s mistakes or disinterest when demonstrated behaviors are equally unsatisfactory. A noise model is proposed, enabling a resource-limited robot to soundly estimate the preference noise and maintain a robust interaction with the expert, thus enforcing a low sample complexity. A proof of principle of the IBPS approach, in simulation and on-board, is presented. 1 Position of the problem Reinforcement learning (RL) [5, 19, 22], aimed at optimal sequential decision making, has been particularly investigated in robotics (see [12] for a survey). In practice, its success critically depends on the smart design of i) the state and action spaces; ii) the reward function. This paper focuses on the reward shaping issue: Not only should the reward function reflect the target robot behavior; it should also induce a robust and tractable optimization problem. Some approaches, based on the demonstration of the target behavior by the expert and ranging from inverse reinforcement learning [17], to learning by imitation [6] or learning by demonstration [14], have been proposed to learn an appropriate reward function. Such approaches however require a strong expertise in the solution of the task at hand. Alleviating this requirement, alternative approaches have been proposed since the late 2000s, based on the expert’s preferences about the robot demonstrations, and automatically deriving a reward function [7], a posterior over the parametric policy space [24] or a policy utility estimate [2, 3]. The general limitations of preference-based RL are twofold. On the one hand, it should require little feedback from the expert (of the order of a few dozen preference judgments) to be effectively usable. On the other hand, it must resist the preference noise due to expert’s actual mistakes or disinterest, e.g. when demonstrated robot behaviors are equally unsatisfactory. The contribution of the paper, the Interactive Bayesian Policy Search (IBPS) framework, simultaneously estimates the policy utility and the confidence thereof, through modelling the expert preference noise. The goal is to decrease the sample complexity of the interactive robot education process, that is the number of preference judgments required to reach a satisfactory behavior. This goal is achieved on-board through a more robust selection of the policies to be demonstrated. Notations. The standard notations of Markov Decision Processes are used in the rest of the paper [19, 22]. Inverse reinforcement learning (IRL), a.k.a. learning by imitation or apprenticeship learning, considers an MDP\r (reward is unknown), a set of state sequences ck = (s (k) 0 s (k) 1 s (k) 2 . . . s (k) Tk )) describing an expert trajectory, and a feature function φ(.) mapping the states (and hence the trajectories) onto IR, with u(ck) = ∑∞ h=0 γ φ(s (k) h ), where 0 < γ < 1 is the discount factor. Assuming that the expert trajectories maximize some target reward r∗, the IRL goal is to find a policy with quasi-optimal performance under r∗ (rather than finding r∗ itself) [1]. Specific exploration mechanisms, e.g. based on Gibbs sampling [13], are used to overcome under-optimal expert trajectories. [11] combines classification-based RL [15, 14] and IRL, formalizing IRL as a structured classification problem. Preference learning-based RL was designed to alleviate the strong expertise assumption underlying the exploitation of human demonstrated trajectories. In [7] the authors use preference learning in replacement of classification algorithms in rollout classification-based policy iteration (RCPI) [9]. The authors advocate that action ranking is more flexible and robust than a supervised learning based approach. In [2], the authors use preference learning to define a policy utility estimate; the main difference with [7] is that the former defines an order relation on the action space depending on the current state, whereas the later defines an order relation on the trajectories. Active preference-based policy learning [24, 3] exploits pairwise preferences among trajectories demonstrated by the agent, whereas IRL exploits trajectories demonstrated by the human expert. The main difference between [24] and [3] concerns the demonstrated trajectories: short trajectories are selected in [24], assuming that the initial states can be sampled after some prior distribution P (s0) in order to enforce visiting interesting regions of the state space. No such selection of interesting excerpts is required in [3]; in counterpart, the expert judgment errors might become more likely as the expert is required to see the whole trajectories. Formally, [24] infers a distribution over the parametric policy space. On the one hand, this setting enables the direct sampling of policies from the posterior. On the other hand, it makes it more expensive to update the posterior distribution after a preference constraint is added: the similarity between a parametric policy and the trajectories involved in a preference constraint requires a good amount of rollouts of the parametric policy to be computed. The policy selection in [3] is inspired from the expected utility of selection principle [23]. For tractability, a criterion based on the ranking-SVM objective value is used to estimate the utility of selecting a policy, conditionally to it being better/worse than the current best policy. However, the preference judgment noise is not taken into account. The IBPS approach, building upon the APRIL framework [3], focusses on the policy robustness w.r.t. noisy expert preferences (section 2). A noise model is estimated on-board, enabling a limited-resource robot to cope with less-thanideal experts. Under this model, the expected utility of selection is proved to be a good approximation of the expected posterior utility, the ideal but hardly tractable active selection criterion. The in-situ evaluation of IBPS establishes a proof of principle of the approach and sheds some light on the expert judgment errors (section 3). Their impact on the robot training (with and without noise modelling) is discussed in section 4. 2 Interactive Bayesian Policy Search The IBPS algorithm elaborates on an active preference policy learning setting, inspired from [3] and distinguishing two search spaces. The first space X , referred to as input or parametric space, is used to describe and sample policies (in the following X = IR). Policy πx is represented as a vector x, describing the stateaction mapping (subscript x will be omitted for readability when clear from the context). The second space Φ(X ) and referred to as feature or behavioral space, is used to describe a policy behavior or trajectory, and learn the policy utility function. Formally, a robot trajectory generated from policy πx, expressed as a state-action sequence (s0, πx(s0), s1 . . . πx(sH−1), sH) is represented as a vector u ∈ IR, with u = ∑H i=0 γ φ(si), 0 < γ ≤ 1 and φ a feature function mapping the state space (or the state × action space) onto IR. In the following, we shall restrict ourselves to linear trajectory utilities, where the utility of trajectory u is set to 〈w,u〉 and w is a unit vector in IR. The πx policy utility is the trajectory utility expectation, over the trajectories generated from πx. Following the general framework described in [3, 24], active preference-based policy learning iterates a 3-step process: i) the learning agent demonstrates at least one new trajectory; ii) the expert expresses pairwise preferences between two newly demonstrated trajectories or a new trajectory and the former best trajectory (or his recollection thereof); iii) the policy utility estimate, that is the robot model of expert’s preferences, is updated and a new policy is selected according to an active criterion. After [3], the use of both the input/parametric and the feature/behavioral spaces is motivated by the expressiveness/tractability dilemma. On the one hand, a high dimensional continuous search space is required to express competent policies. Still, such high-dimensional search spaces makes it difficult to learn a preference-based policy return from a moderate number of expert preferences. On the other hand, the behavioral space does enable to learn a preference-based policy return from the little evidence provided by the expert, despite the fact the behavioral description is insufficient to describe a flexible policy. 2.1 Preference and noise model Letting w∗ denote the true (hidden) utility of the expert, his preference judgment on the pair of trajectories {u,u′} is modeled as a noisy perturbation of his true preference 〈w∗, (u− u′)〉. The preference noise is usually modelled in the literature as a Gaussian perturbation ([8, 24]) or following the Luce-Shepard model [16, 18, 23]. In both cases, the noise model involves an extra-parameter (respectively the standard deviation of the Gaussian perturbation or the temperature of the Luce-Shepard rule) controlling the magnitude of the noise. The noise model considered in IBPS involves a single scalar parameter δ ∈ IR, δ > 0, where the probability of the expert preferring u over u′ given the true preference z = 〈w∗, (u− u′)〉 is defined as: P (u u′ | w∗, δ) = 1 2δz + 1 2 , if |z| < δ and P (u u′ | w∗, δ) = 1 (resp. 0) if z ≥ δ (resp. z ≤ −δ). This simple ridged model allows IBPS to handle the uncertainty on the noise threshold δ through analytical integration over the δ distribution. A first option is to consider that the expert consistently answers according to a hidden but fixed δ∗. The second option assumes that δ can vary over time. Arguably, the former option is less flexible and less robust, as one abnormally large mistake can prevent the robot from identifying an otherwise consistent ranking model. Inversely, the latter option while being more robust could slow down the identification of the expert’s utility function. This latter option is however clearly more appropriate to accommodate the cases where the task (or the expert’s understanding thereof, or his preferences) might change over time. In the remainder of the paper, a distinct noise scale parameter δt is considered to model the expert preference noise after the t-th preference judgment has been emitted. Let Ut = {u0,u1, . . . ; (ui1 ui2), i = 1 . . . t} denote the archive of all trajectories demonstrated to the expert and the expert’s preference judgments up to the t-th iteration. Let us set a uniform prior p(w) over the unit sphere W of IR, and let us likewise assume that the prior over the noise scale parameter δi is uniform on interval [0,M ]. Given Ut, the posterior distribution of the utility function reads:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robots and Art: Interactive Art and Robotics Education Program in the Humanities

We describe the design of an undergraduate course in art and robotics that aims to integrate basic concepts of computer science, robotics and art installation for undergraduate students within the problem-based learning model. Our methodology aims to bridge the gap that separates humanities from computer science and engineering education to prepare students to address real world problems in rob...

متن کامل

Visualization Tools for Robotics Education

Robotics typically involves several objects interacting in motions that are sometimes hard to visualize without having a well-equipped robotics laboratory. Computer graphics display rendering of objects in motion an inexpensive, safe, and interactive visualization tools to assist in Robotics education. In recent years software and hardware tools have been developed in connection with the rise a...

متن کامل

An Off-line Simulation Package for Robotics Education and Industrial Purposes

In this paper, a novel off-line robot simulation package is developed for the industrial robot manipulators. The package allows interactive simulation of fundamental robot manipulators classified by Huang and Milenkovic [1]. The new package using Matlab Graphical User Interface (GUI) commands provides an interactive environment for serial-link robot manipulators. In the new package, there are s...

متن کامل

Comparison of Social Presence in Robots and Animated Characters

As interactive robots become an increasingly expressive medium that we can use in applications as varied as entertainment or education, insight into details of human-robot interaction becomes more necessary. This study uses a social presence scale to show how robots compare to humans and animated characters during interactions with subjects and discusses the implications of these findings. The ...

متن کامل

A Comparison of Interactive and Robotic Systems in Therapy and Education for Children with Autism

In this paper we discuss the use of interactive environments in autism therapy. In particular we review the Aurora project and summarise the results of three studies within Aurora project, one using a non-humanoid mobile robot, one using a small humanoid doll robot, and the last using a touch sensitive screen. All three studies seek, in different ways, to promote social skills and an enhanced s...

متن کامل

Interactive Visualization Tools for Robotics

Robotics typically involves several objects interacting in motions that are sometimes hard to visualize without having a well equipped robotics laboratory. In recent years computer tools have been developed in connection with the rise and development of the Internet, including several programming tools that allow easy visualization, 3-dimensional rendering of objects in motion, and virtual real...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013